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MULTI-SAMPLE  CLUSTER  ANALYSIS  AS  AN  ALTERNATIVE 
TO  MULTIPLE  COMPARISON  PROCEDURES 


HAMPARSUM  BOZDOGAN 
Department  of  Mathematics 
University  of  Virginia 
Charlottesville,  Virginia 


ABSTRACT 

This  paper  studies  multi-sample  cluster  analysis,  the  problem  of  grouping 
samples,  as  an  alternative  to  multiple  comparison  procedures  through  the  development  and 
the  introduction  of  mo  del -selection  criteria  such  as  those:  A  kaike's 
Information  Criterion  (A7C)  and  Schwarz's  Criterion  (SC),  as  new  procedures 
for  comparing  means,  groups,  or  samples,  and  so  forth,  in  identifying  and  selecting 
the  homogeneous  groups  or  samples  from  the  heterogeneous  ones  in  multi-sample  data 
analysis  problems. 

An  enumerative  clustering  technique  is  presented  to  generate  all  possible  choices  of 
clustering  alternatives  of  groups,  or  samples  on  the  computer  using  efficient  combinatorial 
algorithms  without  forcing  an  arbitrary  choice  among  the  clustering  alternatives,  and  to  find 
all  sufficiently  simple  groups  or  samples  consistent  with  the  data  and  identify  the  best 
clustering  among  the  alternative  clusterings. 

Numerical  examples  are  carried  out  and  presented  on  a  real  data  set  on  grouping  the 
samples  into  fewer  than  K  groups.  Through  a  Monte  Carlo  study,  an  application  of 
multi-sample  cluster  analysis  is  shown  in  designing  optimal  decision  tree  classifiers  in 
reducing  the  dimensionality  of  remotely  sensed  heterogeneous  data  sets  to  achieve  a 
parsimonious  grouping  of  samples. 

The  results  obtained  demonstrate  the  utility  and  versatality  of  model-seiection  criteria 
which  avoid  the  notorious  choice  of  levels  of  significance  and  which  are  free  from  the 
ambiguities  inherent  in  the  application  of  conventional  hypothesis  testing  procedures. 

KEY  WORDS  AND  PHRASES:  Multi-Sample  (’luster  Analysis;  Multiple  Comparison 

Procedures;  Model  Selection  Criteria;  Akaike's  Information 
Criterion  (AIC);  Schwarz's  Criterion  (SC). 


1.  INTRODUCTION 


Many  practical  situations  require  the  presentation  of  multivariate  data  from  several 
structured  samples  for  comparative  inference  and  the  grouping  of  the  heterogeneous  samples 
into  homogeneous  sets  of  samples. 

For  example,  in  analysis  of  variance,  to  describe  any  variations  of  the  treatment 
means,  we  partition  the  treatment  means  into  groups  with  hopefully  the  same  mean  for  all 
treatments  in  the  same  group  to  find  a  more  parsimonious  grouping  of  treatments  (Cox  and 
Spj#tvoll  1982). 

In  remote  sensing  technology,  see,  e.g.,  Argentiero  et  al.  (1982),  we  classify  or  group 
different  samples  of  large  dimensional  remotely  sensed  heterogeneous  data  sets  into 
homogeneous  sets  of  samples  to  reduce  the  dimensionality  of  these  data  sets  and  to  design 
optimal  decision  tree  classifiers.  Decision  tree  classifiers  are  popular  and  useful  to  study 
the  underlying  data  structure  which  have  the  property  that  samples  are  subjected  to  a 
sequence  of  decision  rules  before  they  are  assigned  to  a  unique  class  to  identify  and 
determine  the  number  of  types  that  the  classes  originally  might  have  been  consisted.  Such 
an  approach,  providing  that  it  is  well  designed,  will  give  us  a  classification  scheme  which  is 
accurate,  flexible,  and  computationally  efficient. 

The  purpose  of  this  paper  is,  therefore,  to  propose  and  to  study  Multi-Sample  Cluster 
Analysis  (MSCA),  the  problem  of  grouping  samples,  developed  by  this  author  (see,  e.g., 
Bozdogan  1981,  Bozdogan  and  Sclove  1984),  as  an  alternative  to  Multiple  Comparison 
Procedures  (MCP’s)  through  the  development  and  introduction  of  model-selection  criteria 
such  as  those  of  Akaike  (1973,  1974),  Akaike  (1978),  and  Schwarz  (1978),  as  new  procedures 
for  the  comparisons  and  identification  of  various  collections  of  groups,  samples,  treatments. 
experimental  conditions,  or  diagnostic  classifications,  and  so  forth,  in  multi-sample  data 
analysis  problems. 

In  the  statistical  literature,  the  Analysis  of  Variance  (ANOVA)  is  a  widely  used  model 
for  comparing  two  or  more  univariate  samples,  where  the  familiar  Student's  t  and  F 
statistics  are  used  for  formal  comparisons  among  two  or  more  samples.  In  the  multi-sampie 
case  the  Multivariate  Analysis  of  Variance  (MANOVA)  is  a  widely  used  model  for  comparing 
two  or  more  multivariate  samples.  In  the  MANOVA  model,  the  likelihood  ratio  principle 
leads  to  Wilks'  (1932)  lambda,  or  in  short  Wilks'  A  criterion  as  the  test  statistic.  It  plays 
the  same  role  in  the  multivariate  analysis  that  F-ratio  statistic  piays  in  the  univariate  case. 

Often,  however,  the  formal  analyses  involved  in  ANOVA  or  in  MANOVA  are  not 
revealing  or  informative.  For  this  reason,  in  any  problem  where  a  set  of  parameters  is  to 
be  partitioned  into  groups,  it  is  reasonable  to  provide  a  practically  useful  statistical 


procedure  or  procedures  that  would  use  some  sort  of  statistical  model  to  aid  in  comparisons 
of  various  collections  of  comparable  groups,  samples,  etc.,  and  identify  the  homogeneous 
groups  from  the  heterogeneous  ones,  or  vice  versa,  and  tell  us  which  groups  (or  samples) 
should  be  clustered  together  and  which  groups  (or  samples)  should  not  be  clustered 
together. 

The  object  of  this  paper  is  to  point  out  an  enumerative  clustering  technique  to 
generate  all  possible  choices  of  clustering  alternatives  of  groups  or  samples  on  the 
computer  using  efficient  combinatorial  algorithms  without  forcing  an  arbitrary  choice  among 
the  clustering  alternatives. 

Thus  the  central  idea  is  that  through  Multi-Sample  Cluster  Analysis  (MSCA)  as  an 
alternative  to  Multiple  Comparison  Procedures  (MCP's)  and  through  the  use  of 
model-selection  criteria  we  shall  find  all  sufficiently  simple  partitions  of  groups  or  samples 
consistent  with  the  data  and  identify  the  best  clustering  among  the  alternative  clusterings. 
We  achieve  this  by  utilizing  a  new  information-theoretic  approach  to  the  multi-sample 
conventional  tests  of  homogeneity  models  discussed  in  Bozdogan  (1984).  This  approach 
unifies  the  conventional  test  procedures  without  the  worry  of  what  level  of  significance  a 
one  needs  to  use.  In  a  conventional  pre-test  situation,  it  has  become  customary  to  fix  the 
level  of  significance  a  priori  at,  for  example,  1%,  5%,  or  10%  levels  regardless  of  the 
number  of  parameters  estimated  within  a  model.  This  is  essentially  arbitrary  and  no 
rational  basis  exists  for  making  such  an  arbitrary  choice.  Model-selection  criteria  adapt 
themselves  to  the  number  of  parameters  estimated  within  a  model  to  achieve  parameter 
parsimony,  and  the  significance  level  is  adjusted  accordingly  from  one  model  to  the  next. 

In  Section  2,  we  shall  briefly  discuss  the  Multiple  Comparison  Procedures  (MCP's)  and 
present  their  formulation  in  the  multivariate  case.  Then  we  shall  outline  the  existing 
problems  inherent  with  the  MCP's.  In  Section  3,  we  shall  propose  Multi-Sample  Cluster 
Analysis  (MSCA)  as  an  alternative  to  conventional  Multiple  Comparison  Procedures  (MCP's). 
We  shall  define  the  general  MSCA  problem,  and  discuss  how  to  obtain  the  total  number  of 
clustering  alternatives  for  a  given  K,  the  number  of  groups  or  samples  in  detail  for  both 
MCP's  and  MSCA.  In  the  subsequent  section,  that  is,  in  Section  4,  we  shall  briefly  give 
the  formal  definitions  of  model-selection  criteria  and  present  the  three  most  commonly  used 
multivariate  multi-sample  models,  that  is,  multi-sample  hypotheses,  and  give  their 
model-selection-replacements.  For  more  on  this,  we  refer  the  reader  to  Bozdogan  (1984). 
In  Section  5,  we  shall  give  numerical  examples  on  a  real  data  set,  and  show  an  application 
of  MSCA  in  designing  optimal  decision  tree  classifiers. 

Finally,  in  Section  6,  we  shall  present  our  conclusions,  and  give  a  listing  of  the 
combinatorial  subroutines  in  the  Appendix. 


2 


2.  MULTIPLE  COMPARISON  PROCEDURES  (MCPs) 


In  the  univariate  analysis  of  variance  (ANOVA)  model  for  testing  the  equality  of  K 
population  means,  as  we  mentioned  in  the  introduction  of  this  paper,  the  test  statistic 
F  =  SB/SW  is  used  for  comparing  several  population  means.  If  we  compute  the  value 
of  F  for  the  sample  data,  and  if  it  is  larger  than  the  critical  value  of  F  obtained  from 
standard  F-tables  at  some  prescribed  a  level,  then  we  reject  the  overall,  "omnibus",  null 
hypothesis 


Ho  :  =  M2  =  ...  =  UK 


(2.1) 


in  favor  of  the  alternative  hypothesis  given  by 


H1 


the  K  population  means  are  not  ail  equal. 


While  rejecting  the  null  hypothesis  gives  us  some  information  about  the  population 
means,  namely  the  heterogeneity  of  the  means,  we  do  not  know  which  means  differ  from 
each  other.  Hence,  both  ANOVA  or  MANOVA  do  not  pinpoint  exactly  where  the  significant 
differences  lie,  and  an  F  test  alone,  generally  falls  short  of  satisfying  all  of  the  practical 
requirements  involved  (Duncan  1955).  For  example,  if  K  =  3  and  HQ:  M1=U2S'W3  *s 
rejected,  then  we  do  not  know  whether  the  main  differences  are  between  and  m2.  or 
between  xij  and  M3,  and  so  on.  Therefore,  we  are  faced  with  many  new  problems,  and 
we  may  ask  the  following  simple  and  yet  important  questions:  Does  Mj  differ  from 


w2 


Does  M}  differ  from  M3  ?  .  Which  of  the  samples  are  considered  coming  from 


common  populations,  which  are  not  ? 

As  in  the  univariate  ANOVA  model,  the  same  problems  arise  in  the  multivariate 
analysis  of  variance  (MANOVA)  model  also.  That  is,  rejection  of  the  null  hypothesis  does 
not  indicate  which  groups,  samples,  or  treatments,  or  any  combinations  of  them  are 
different  and  which  should  be  considered  as  coming  from  common  populations,  which  are 
not. 

Therefore,  it  is  important  to  obtain  some  idea  where  the  differences  in  the  means  or 
mean  vectors  are  when  we  reject  the  null  hypothesis  and  establish  some  relationships  among 
the  unequal  means  or  mean  vectors. 


2.1  Formulation  of  MCP’s 


In  the  univariate  case,  i.e.,  in  the  case  of  one  response  variable,  there  exists  a 
multitude  of  Multiple  Comparison  Procedures  (MCP’s)  available  in  the  literature.  However, 
in  the  multivariate  case,  even  only  two  variables,  there  seems  to  be  a  few  applicable 
techniques  have  been  developed  for  MCP's  in  practice.  The  problem  of  comparing  the  means 
o::  two  Multivariate  Normal  (MVN)  populations,  assuming  a  common  covariance  matrix  Z, 
cen  easily  be  extended  to  the  case  of  comparing  K  normal  populations  when  there  are  ng 
independent  p-dimensional  observations  from  the  gth  population. 

Following  Seber  (1984,  p.433),  we  now  recapitulate  the  formulation  of  MCP’s  in  the 
multivariate  case. 

Let  ygj  be  the  ith  sample  observation  (i=l,2 . ng)  from  the  ith  MVN  distribution 

Np(Ug.Z)  (g=l,2 . K)  so  that  we  have  the  following  MANOVA  model  for  comparing  g 

population  mean  vectors. 


(2.2) 


i 


^  =  XB  +  E  . 

The  hypothesis  of  equal  population  means,  HQ:  u ^  ^ 

in  the  form 


,  can  be  written 


1  0  . 


CB  = 


0  0  . 


When  the  hypothesis  of  equal  population  means,  HQ  ,  is  rejected,  those  means,  or  linear 
combinations  of  them,  that  led  to  the  rejection  of  the  hypothesis  are  of  interest.  We, 
therefore,  use  Roy's  (1957)  maximum  root  test  to  construct  simultaneous  intervals  for  the 
mean  differences  since  it  has  the  advantage  of  being  linked  directly  to  a  set  of 
simultaneous  confidence  intervals,  although  it  is  not  as  powerful  as  the  likelihood  ratio 


For  a  general  linear  model  we  have 


1-a  =  Pr[  0  $  0 

1  max  '  a 


=  Pr[  a'CBb  €  a'CBb  +  {0  a'C(X'X)'1  C'a  •  b'Wb)1/2 

rv  r  j  r-j  t*j  i*v 

for  all  a,  bj  . 


Thus  a  set  of  multiple  confidence  intervals  for  ail  linear  combinations  is  given  by 

a'CBb  +  (0  C(X'X)-1  C'a  ♦  b'Wb)1/2  ,  (2.6) 

'j  “  “  “  “  “  fJ  r-S  “ 

1 

and  the  set  has  an  overall  confidence  of  100(l-a)%,  and  where  B  =  (x’x)-i  x'y  , 
and  w  is  the  "within-groups"  or  "within-samples"  SSP  matrix.  Applying  this  to  our  one 
way  model  in  (2.4),  we  have 


4 


CB  = 


<U,  -  Mv> ’ 

<Ji2  -  UK) 


(£Jk-i  "  Hk5  ' 


so  that 


a’CB  =  aj  (Wj  -  Mk)’  +  a2  (m2  -  UK>'  +...+  &K-1  (uK_2  -  MR  )' 
k 

=  2  c  m’  . 

g-1 

K-  1  K 

where  cg  =  ag  (g=l,2 . K-l),  cw  =  -  Z  a  „  and  Z  c„  =  0.  We 


'K 


g“l 


g“l 


find  that  the  class  of  all  contrasts  of  the  u  .  Furthermore,  since  in  the 

classification  model  y„.  is  the  least  squares  estimate  of  M„  , 
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simultaneously  for  all  contrasts  and  all  b. 

In  general,  we  would  be  interested  in  pairwise  contrasts  m„  -  U 
corresponding  subset  of  (2.8),  namely. 
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therefore 
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If  the  maximum  root  test  of  HQ:  CB  =  0  is  significant,  then  at  least  one  of  the 
intervals  given  by  (2.8)  does  not  contain  zero  and  we  can  use  the  simultaneous  intervals  to 
look  at  any  contrast  suggested  by  the  data. 

Krishnaiah  (1965,  1969,  1979),  based  on  a  multivariate  analogue  of  Tukey’s  Studentizea 
range,  proposed  a  set  of  simultaneous  intervals  for  ail  linear  combinations 


(ur  -  ^)’b  .  Writing  HQ:  jlx  ^  =  u 
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where  H  :  u  -  u  =  0  ,  we  can  test  each  of  the  I  =  f^l  =  K(K-l)/2 

ors  (V-  rvg  «S( 
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hypotheses  HQrs  using  a  Hotelling's  T  statistic,  T^,  say,  based  on  a  pooled  estimate 
§w  =  w/v  of  2,  where  v  -  2  (n  -  1)  =  n  -  K.  We  can  test  HQ  using 


satisfies 
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“ax  i$k$I  k 


W  nax  «caIH0)  =  l 


k  =  1.2 . 1  i  Hq]  =  1  -  a,  and  the  probability  is  1-a  that 


(m  -  u  )'b  €  (y  -  y  )'b  +  <c  (-^-  +  r^b’wb}-1 


PH  t£  S  a, 


simultaneously  for  all  r,  s  (r*s)  and  for  all  b.  These  intervals  are  the  same  as  (2.9), 
except  that  0Q  is  replaced  by  ca.  Since  the  intervals  of  (2.9)  are  a  subset  of  (2.8),  the 
overail  probability  exceeds  1-a  and  0Q  >  ^a/v-  Unfortunately,  extensive  tables  of  cQ  are 
not  available. 

If  we  are  interested  in  just  a  certain  number,  say,  m,  of  the  elements  of  B,  we  can 
use  the  Bonferroni  method  of  constructing  m  conservative  confidence  intervals  with  an 
overall  confidence  of  at  least  100(1- a)%,  and  a  t-vaiue  ty^2m 

For  more  details  on  MCP's,  we  refer  the  reader  to  Duncan  (1955),  Gabriel  (1964, 
1968),  Miller  (1966,  1981),  O’Neill  and  Wetherill  (1971),  Thomas  (1973),  Spj<*tvoil  (1974), 
Seber  (1984),  and  many  other  authors  since  the  literature  is  quite  rich  in  this  area. 


2.2  Problems  With  MCP's 


While  many  MCP's  have  been  proposed  in  many  different  papers  in  the  univariate 
case,  including  the  ones  referred  to  as  above,  unfortunately,  there  are  still  some  serious 
drawbacks  of  these  procedures,  and  there  are  a  few  MCP's  available  in  practice  in  the 
multivariate  case  which  are  operational. 

The  major  problems  with  MCP's  in  general  can  briefly  be  summarized  as  follows. 

(i)  MCP's  either  reject  or  accept  the  hypothesis  of  homogeneity,  that  is,  equality  of 
means,  or  equivalently,  an  MCP  declares  each  set  of  means  as  heterogeneous  (rejected)  or 
homogeneous  (accepted).  MCP  decision  rule  is  not  transitive;  i.e.,  model  Mj  may  be 
preferred  to  M£,  M2  to  M3,  and  M3  to  Mj,  etc. 

(ii)  The  decision  to  accept  or  reject  a  model  depends  on  a  given  significance  level  a 
to  maximize  the  power  of  the  test.  In  an  MCP,  it  is  not  clear  how  the  level  of  the  test 
should  be  defined,  and  it  is  not  clear  how  to  control  the  overall  error  rate.  Also,  it  is  not 
clear  what  should  be  optimized. 

(iii)  Running  all  [2)  =  K(K-l)/2  pairwise  MCP’s  increases  the  number  of  null 

hypothesis  to  be  tested,  and  more  likely  we  would  reject  one  of  them  if  all  the  null 
hypothesis  are  actually  true,  and  thus  increasing  the  probability  of  incorrectly  rejecting  at 
least  one  HQ. 

(iv)  Existing  MCP’s  in  general  are  ail  devised  to  handle  pairwise  comparisons.  They 

need  to  be  extended  to  handle  all  k-subsets  of  a  K-set  hypothesis. 

(v)  In  MCP's,  arbitrary  assumptions  are  made  on  the  parameters  of  the  models.  For 

example,  in  the  formulation  of  MCP's  in  Section  2.1,  for  the  MANOVA  model  we  assumed  a 
common  dispersion  matrix  Z.  For  unequal  Zg  ,  i.e.,  for  covariance  heterogeneity,  Olson 
(1974)  from  a  large  scale  simulation  study,  reported  a  high  inflation  in  Type  I  error  and 

excessive  rejections  of  HQ  on  the  basis  of  Roy's  maximum  root  test,  <*max-  In  this 

respect  the  ollmr  two  statistics  Wilks'  A  and  Hotelling's  T^  behave  like  *max- 

Therefore,  it  might  be  expected  that  in  the  presence  of  covariance  heterogeneity,  MCP's 
might  also  give  erroneous  results. 

(vi)  In  the  multivariate  literature  there  does  not  exist  any  simple  MCP  to  handle  the 
case  where  both  mean  vectors  and  the  covariance  matrices  in  a  model  might  vary  and  we 
still  want  to  carry  out  comparative  simultaneous  inference.  The  same  holds  in  the  case  of 
complete  homogeneity,  that  is,  when  data  are  assumed  to  have  come  from  identical 
populations  and  we  still  want  to  carry  out  comparative  simultaneous  inference. 

Clearly,  there  are  many  problems  connected  with  the  existing  MCP’s.  Therefore,  for 
this  reason,  in  the  next  section,  we  shall  introduce  and  utilize  a  general  methodology  called 
Multi-Sample  Cluster  Analysis  (MSCA)  as  an  alternative  to  Multiple  Comparison  Procedures 
(MCP's).  MSCA  depends  on  fast  and  efficient  combinatorial  algorithms.  The  analysis  is 


3.  MULTI-SAMPLE  CLUSTER  ANALYSIS  AS  AN  ALTERNATIVE  TO  MULTIPLE 
COMPARISON  PROCEDURES 

The  problem  of  MCP's  can  be  viewed  as  one  of  clustering  of  means,  groups,  samples, 
or  treatments.  The  possibility  of  using  cluster  analysis  in  place  of  an  MCP  appears  to  be 
originally  suggested  by  Plackett  in  his  discussion  of  the  review  paper  by  O’Neill  and 
Wetherill  (1971). 

In  the  literature,  Scott  and  Knott  (1974)  used  a  cluster  analysis  approach  for  grouping 
means;  Cox  and  Spjltvoll  (1982)  used  simple  partitioning  of  means  into  groups  based  on  the 
standard  F  statistic,  to  mention  a  few.  Their  procedures  in  the  spirit  are  similar  to  ours, 
but  in  general  our  method  is  completely  different  and  new.  Therefore,  here  we  shall 
propose  MSCA  or  what  Gower  (1981)  calls  it,  "K-Group  Classification"  or  equivalently  what 
we  also  call  ”K-Sample  Cluster  Analysis"  as  an  alternative  to  MCP’s. 

Next,  we  discuss  the  general  MSCA  problem. 

3.1  The  Multi-Sample  Cluster  (MSC)  Problem 

The  problem  of  Multi-Sample  Cluster  Analysis  (MSCA)  arises  when  we  are  given  a 
collection  of  groups,  profiles,  samples,  treatments,  etc.,  whether  these  are  formed  naturally 
or  experimentally,  and  our  goal  is  to  cluster  these  into  homogeneous  groups.  Thus  the 
problem  here  is  to  cluster  "groups"  or  "samples"  rather  than  "individuals”  or  "objects"  as  in 
the  single-sample  case. 

Suppose  each  individual,  object,  or  case,  has  been  measured  on  p  response  or  outcome 
measures  (dependent  variables)  simultaneously  in  K  independent  groups  or  samples  (factor 
levels).  Let 


be  a  single  data  matrix  of  K  groups  or  samples,  where  x(ngxp)  represents  the 

K 

observations  from  the  gth  group  or  sample,  g=l,2 . K,  and  n  =  I  n_.  The  goal  of 

g-1  g 
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cluster  analysis  is  to  put  the  K  groups  or  samples  into  k  homogeneous  groups,  samples,  or 
classes  where  k  is  unknown  and  varying,  but  k  ^  K. 

Thus  we  obtain  a  smallest  number  k  such  that  the  data  are  consistent  with  K  groups, 
and  from  a  robustness  viewpoint  of  test  statistics,  K  should  generally  be  as  small  as 
possible.  Surprisingly,  robustness  properties  do  not  always  improve  with  increasing  group 
size:  small  groups  are  preferred  to  achieve  a  parsimonious  grouping  of  samples,  and  in 

general  to  reduce  the  dimensionality  of  multi-sample  data  sets. 

We  generate  all  possible  clustering  alternatives  of  groups  or  samples  on  the  computer 
using  efficient  combinatorial  algorithms  and  we  assemble  information  from  all  the  different 
groupings  of  size  K  without  forcing  an  arbitrary  choice  among  the  clustering  alternatives. 

We  next  discuss  how  to  obtain  the  total  number  of  clustering  alternatives  for  a  given 
K,  the  number  of  groups  or  samples. 

3.2  Determining  the  Number  of  Clustering  Alternatives 

Let  K  be  the  number  of  samples,  and  let  k  be  the  number  of  clusters  of  samples.  If 
we  use  MCP's,  and  if  all  pairwise  comparisons  among  the  K  groups  were  desired,  then  this 
would  require  in  general  ^2  =  £*]  =  K(K-l)/2  tests.  On  the  other  hand,  if  we  consider 
the  combinations  of  K  groups  or  samples  taken  k  at  a  time,  where  k  $  K,  then  there  are 
[k]  k-subsets  of  a  K-set  altogether.  In  Appendix  A.l,  we  give  a  simple  algorithm  called 
MCP  which  constructs  all  the  possible  alternatives  sequentially  in  "lexicographic",  i.e.,  in 
"alphabetical  order".  A  listing  of  the  output  from  the  subroutine  of  MCP  is  shown  in  Table 
3.1. 

We  note  that  the  existing  conventional  MCP's  are  not  devised  to  handle  the  case  of 
££j,i.e.,  k-subsets  of  a  K-set  hypotheses  or  tests  for  k  >  2.  They  all  need  to  be  modified 
accordingly. 

If  we  use  the  complete  enumeration  technique,  then  the  number  of  clustering  of  K 
groups  or  samples  into  k  nonempty  clusters  of  samples  is  given  by  the  following  theorem. 
Theorem  3.2.1.  The  number  of  ways  of  clustering  K  samples  or  groups  into  k-sampie 
clusters  where  k  <  K  such  that  none  of  the  k-sampie  clusters  is  empty  is  given  by 

1  f  *1  (-D  8  (k-g)  K  ,  (3.2) 

g-0  t»J 

when  the  order  of  samples  (or  groups)  within  each  cluster  is  irrelevant. 

Proof  Duran  and  Odell  (1974,  p.26). 

In  this  theorem  the  k-sampie  clusters  are  assumed  to  be  distinct.  However,  in 
clustering  or  partitioning  K  samples  into  k  subsets,  none  of  which  is  empty,  the  order  of 
k-sample  clusters  or  k-subsets  is  irrelevant.  Consequently,  from  this  fact  and  Theorem 
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3.2.1,  it  follows  that  the  total  number  of  wavs  of  clustering  K  samples  into  k-sample 
clusters  (or  subsets)  is  given  by 

w  =  S(K,k)  =  (-1)*  [g]  (k-g)  K  (3.3) 

which  is  known  as  the  Stirling  Number  of  the  Second  Kind,  which  gives  us  the  number  of 
clustering  alternatives. 

If  k,  the  number  of  clusters  of  samples  is  known  in  advance,  then  the  total  number 
of  clustering  alternatives  is  given  by  S(K,k).  However,  if  k  is  not  specified  a  priori  and  is 
unknown,  but  k  ^  K,  then  the  total  number  of  clustering  alternatives  is  given  by 

2  S(K,k)  .  (3.4) 

l 

S(K,k)  can  be  written  in  terms  of  the  recursive  formula 

S(K,k)  =  kS(K-l,k)  +  S(K-l.k-l)  with  S(l,l)=  1  (3.5) 

and  S(l,k)  =  0  for  k  *  1.  and  S(K,2)  =  2K~1  -  1. 

For  detailed  explanations  and  proofs,  see,  e.g.,  Duran  and  Odell  (1974),  and  Spath 

(1980).  Table  3.2  gives  S(K,k)  for  values  of  K  and  k  up  to  10  which  is  generated  from  the 
subroutine  STIRN2  in  Appendix  A.2.  This  subroutine  constructs  a  table  of  total  number  of 

clustering  alternatives  for  various  values  of  K,  number  of  samples,  and  k  varying  number  of 

clusters  of  samples. 

Consider,  for  example,  K  =  4  samples.  We  now  wish  to  cluster  K  =  4  groups  or 

samples  first  into  k  =  1  group  or  sample,  k  =  2  groups  or  samples,  k  =  3  groups  or 

samples,  and  k  =  4  groups  or  samples  in  a  hierarchical  fashion.  In  order  to  be  able  to 
generate  all  possible  clustering  alternatives,  we  utilize  Table  3.2.  We  have  the  total 
number  of  ways  of  clustering  K  =  4  groups  or  samples  into  k  =  1  homogeneous  group  or 
sample  is  1.  The  total  number  of  ways  of  clustering  K  =  4  groups  or  samples  into  k  =  2 
homogeneous  groups  or  samples  is  7.  The  total  number  of  ways  of  clustering  K  =  4  groups 
or  samples  into  k  =  3  homogeneous  groups  or  samples  is  6,  and  finally,  the  total  number  of 
ways  of  clustering  K  =  4  groups  or  samples  into  k  =  4  homogeneous  groups  or  samples  is 
1.  Thus  adding  up  these  results,  we  obtain,  in  total  15  clustering  alternatives  as  the  total 
for  K  =  4  groups  or  samples  into  k  =  1,2,3,  and  4  homogeneous  groups.  We  note  that  15 

is  nothing  but  the  sum  of  the  values  of  row  4  in  Table  3.2. 

In  general,  clustering  alternatives  can  be  classified  according  to  their  representation 
forms  to  make  it  easy  to  list  all  possible  clustering  alternatives.  The  subroutine  RliPFM  in 


Appendix  A.3  gives  the  partition  of  K  (number  of  samples)  which  is  a  positive  integer,  into 
a  specified  k  number  of  parts.  For  example,  the  representation  forms  of  K  =  4  groups  or 
samples  into  k  =  1,2,3,  and  4  groups  or  samples  are: 

4  =  (4) 

=  (3)  +  {1} 

=  (2)  +  (2) 

=  (2)  +  (1)  +  (1) 

=  (1)  +  (1)  +  (1)  +  (1)  , 

where  each  of  the  components  in  a  representation  fg)  denotes  the  number  g,  of  groups  or 
samples  in  the  corresponding  cluster.  The  components  of  a  representation  form  will  always 
be  written  in  a  hierarchical  order  to  depict  the  patterns  of  clustering  alternatives.  In  our 
example  there  are  15  clustering  alternatives  but  only  5  representation  forms.  In  general 
the  number  of  representation  forms  is  much  smaller  than  the  number  of  clustering 
alternatives. 

To  generate  and  list  the  clustering  alternatives  corresponding  to  their  representation 
forms,  we  use  the  subroutine  ALLSUB  given  in  Appendix  A.4.  This  subroutine  generates 
and  lists  all  the  simple  patterns  of  clustering  alternatives  for  a  specified  number  of  samples 
K  for  Multi-Sample  Cluster  Analysis  (MSCA).  For  example.  Table  3.3  gives  a  simple  pattern 
of  clustering  alternatives  when  K  =  3  and  K  =  4  groups  or  samples,  and  we  wish  to  cluster 
them  into  k  =  1,2,3  and  k  =  1,2,3,  and  4  homogeneous  groups,  respectively. 

Looking  at  Table  3.3  for  K  =  4  groups  or  samples,  we  see  that,  in  alternative  one, 
the  group  or  sample  1,2,3,  and  4  are  all  clustered  together.  In  terms  of  a  hypothesis  on 
means,  this  corresponds  to  =  Ug  =  w3  =  ^4  ,a^  being  equal.  Hence, 

indicating  that  group  1,2,3,  and  4  are  all  homogeneous  or  identical.  On  the  other  hand,  in 
alternative  fifteen,  the  group  or  sample  1,2,3,  and  4  are  clustered  as  singletons.  In  terms 
of  hypothesis  on  means,  this  corresponds  to  U}>  U3.  and  all  being  different, 

and  therefore,  we  have  4-sample  clusters.  Hence,  indicating  that  groups  1,2,3,  and  4  are 
all  heterogeneous.  In  a  similar  fashion,  we  interpret  the  other  clustering  alternatives 
continuing  down  the  line  of  Table  3.3. 

In  concluding  this  section,  we  see  that  in  general  the  total  number  of  ways  of 
clustering  K  groups  or  samples  into  k  homogeneous  groups  or  samples  is  given  by  equation 

(3.3) ,  and  the  total  number  of  possible  clustering  alternatives  is  given  by  the  expression 

(3.4) .  Furthermore,  the  listings  of  the  necessary  combinatorial  subroutines  are  presented  in 

the  Appendix. 

Having  discussed  how  to  determine  the  number  of  clustering  alternatives,  we  might 
ask  more  questions  as  follows  which  need  to  be  answered. 

(i)  How  do  we  identify  the  best  fitting  or  approximating  modei? 
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<ii)  Which  clustering  alternative  do  we  choose? 

(iii)  Is  it  fair  to  compare  different  models  at  the  same  risk  level? 

(iv)  Should  we  assume  common  or  varying  variance-covariance  matrices  in 

clustering  samples? 

<v)  How  do  we  interpret  the  results?,  and  so  on. 

3.3  Splitting  Algorithm  for  Multi-Sample  Cluster  Analysis  (MSCA) 


When  the  cardinality  of  samples  to  be  clustered  is  more  than  K  =  10  groups  or 
samples,  to  save  computer  time  and  cost  of  computation,  we  use  the  following  Splitting 

Algorithm  to  search  for  an  optimal  clustering  alternative  for  k  =  1,2 . 10,  groups  or 

samples,  stage-wise. 


STAGE-1  :  Start  with  k  =  1-Sample  Cluster,  that  is,  when  ail  the  groups  or  samples  are 
all  together  in  their  own  cluster,  and  compute  the  AIC  and  SC. 

STAGE-2  :  K-Samples  in  the  root  node  is  split  into  k  =  2-Sample  Clusters  by  using  the 
Stirling  Number  of  the  Second  Kind  (STIRN2)  subroutine.  The  AIC's  and  SC's 
are  computed  for  all  the  clustering  alternatives  and  the  best  clustering 
alternative  is  chosen  by  the  minimum  value  of  AIC  or  SC  to  be  split  next. 

STAGE-3  :  The  best  clustering  alternative  in  STAGE-2  based  on  the  value  of  the  criteria 
is  now  split  into  k  =  3-Sample  Clusters  by  STIRN2,  and  the  AIC’s  and  SC's 
are  computed  to  choose  the  best  k  =  3-Sample  Clusters. 


STAGE-4  :  The  process  in  STAGE-3  is  repeated  until  all  the  groups  are  clustered  in  their 
own  singieton  clusters. 


In  this  manner,  the  Splitting  Algorithm  moves  from  one  optimal  stage  to  the  next 
instead  of  generating  all  possible  clustering  alternatives  at  once,  and  then  searching  for  the 
best  clustering  alternative  as  k  (number  of  clusters  of  samples}  varies.  This  requires 
enormous  storage  space  on  the  computer  and  it  is  very  prohibitive,  but  nevertheless,  is  not 
impossible  to  do.  Our  approach  is  very  effective  in  the  sense  that  it  is  more  advantageous 
over  the  Binary  Splitting  Algorithms  used  in  the  literature  since  one  can  see  and  construct 
the  stage-wise  optimal  decision  trees  as  one  walks  through  the  algorithm. 

In  the  next  section,  Section  4,  we  shall  present  our  proposed  new  approach,  namely 


model-selection  criteria  such  as  Akaike's  Information  Criterion  (AIC)  and  Schwarz's  Criterion 
(SC)  (see  also,  Akaike  1978),  as  new  procedures  for  comparisons  and  identification  of  groups 
or  samples  under  three  different  but  linked  multivariate  models,  and  give  their 
AlC-replacements. 


4.  MODEL-SELECTION  CRITERIA  AND  MULTIVARIATE  MODELS 


4.1  Model-Selection  Criteria 

The  "classical"  or  "conventional"  approach  to  the  model  selection  problem  has  its 
basic  roots  in  statistical  hypothesis  testing  problems.  Hypothesis  testing  problems  are 
always  based  on  the  assumption  that  available  data  are  actually  generated  from  one  type  of 
model  with  a  known  structure,  and  the  goal  is  to  select  this  model  by  analyzing  the  given 
data  set. 

On  the  contrary,  in  recent  years,  the  literature  has  placed  more  and  more  emphasis 
on  model  selection  criteria  or  procedures.  The  necessity  of  introducing  the  concept  of 
model  selection  or  model  identification  has  been  recognized  and  the  problem  is  posed  on  the 
choice  of  the  "best”  approximating  model  among  a  class  of  competing  models  by  a  suitable 
model  selection  criteria  given  a  data  set.  Model  selection  criteria  are  figures  of  merit  for 
competing  models.  That  model,  which  optimizes  the  criterion,  is  chosen  to  be  the  best 
model. 

Suppose  there  are  K  alternative  models  M^,  k  =  1,2 . K,  represented  by  the  densities 

fl(*l!?l)*  f2<*lS2) .  fK<*,2K)  for  the  explanation  of  a  random  vector  x  and 

given  n  observations,  and  for  the  identification,  comparison,  and  the  choice  of  the  models 
{M^:  k  €  K)  with  different  number  of  parameters.  Akaike,  in  his  pioneering  work  in  a 
very  important  sequence  of  papers,  including  Akaike  (1973,  1974,  1977,  1981),  developed  a 
model  selection  criterion  for  the  identification  of  an  optimal  and  a  parsimonious  modei  in 
data  analysis  from  a  class  of  models,  which  takes  modei  complexity  into  account.  His 
approach  is  based  on  the  Kullback-Liebler  Information  (KLIC)  and  the  asymptotic  properties 
of  maximum  likelihood  ratio  statistic.  The  AIC  statistic  is  an  estimator  of  the  risk  of  a 
model  under  the  maximum  likelihood  estimation  and  it  is  defined  as  follows. 

Definition  4.1.1  Let  (M^:  k  €  K)  be  a  set  of  competing  models  indexed  by  k  =  1,2, ...K. 
Then,  the  criterion 

AIC(k)  =  -21o^L(e(k)]  +  2m(k)  (4.1) 

which  is  minimized  to  choose  a  model  M^  over  the  set  of  models  is  called  Akaike's 
Information  Criterion  (AIC). 

In  (4.1),  L[9(k)|  is  the  likelihood  function  of  observations,  £(k)  is  the  maximum 
likelihood  estimate  of  the  parameter  vector  9  under  the  modei  M^,  and  m(k)  is  the 
number  of  independent  parameters  estimated  when  M^  is  the  modei.  According  to  AIC, 


inclusion  of  an  additional  parameter  is  appropriate  if  log  L[0(k)j  increases  by  one  unit  or 

v 

more,  i.e.,  if  log.L(0<k)]  increases  by  a  factor  of  e  (~  2.718)  or  more. 

C  is/ 

Recently,  Akaike  (1977,  1978)  and  Schwarz  (1978)  has  developed  a  new  modei-selection 
criterion,  called  BIG,  or  what  we  will  denote  it  here  by  SC,  which  is  defined  as  follows. 

Definition  4.1.2  Let  (M^:  k  6  K)  be  a  set  of  competing  models  indexed  by  k  =  1,2,.. .,K. 
Then,  the  criterion 

SC(k)  =  -21o%L[0(k))  +  m(k)logen  (4.2) 

which  is  minimized  to  choose  a  model  over  the  set  of  models  is  called  Schwarz's 

Criterion  (SC). 

The  criterion  (4.2)  is  derived  from  a  first-order  approximation  to  the  posterior 
probability  of  M^,  over  a  set  of  models.  We  note  that  this  is  similar  to  Akaike's  Bayesian 
Information  Criterion  (BIC)  in  terms  of  its  dependence  on  logen  which  was  developed  by 
Akaike  (1977,  1978). 

According  to  Schwarz’s  Criterion  (SC),  an  additional  parameter  will  be  included  only  if 

A  A* 

it  increases  log  L[0(k)3  by  an  amount  greater  than  log  (n)/2,  that  is,  if  L[#(k)]  increases 

6  im  6  is< 

by  a  factor  of  square  root  of  n  or  more.  Since,  for  n  greater  than  8,  log  n  exceeds  2, 
Schwarz's  Criterion  favors  lower  dimensional  models  as  does  Akaike’s  BIC.  For  large 
sample  sizes,  AIC  and  SC  differ  from  one  another  in  the  manner  in  which  they  adjust  the 

usual  likelihood  ratio  statistic,  taking  into  account  the  difference  in  dimensionality  between 

the  models. 

In  the  literature,  there  exists  other  Akaike-type  model-selection  criteria  which  can  be 
generalized  and  be  put  into  what  we  call  Generalized  Information  Criterion  (GIC)  defined  by 

GIC(k)  =  -21o*k  L(9(k)]  +  a(n)m(k)  +  b(k),  (4.3) 

where  n  is  the  sample  size,  loge  =  In  denotes  the  natural  logarithm,  L[9(k)|  denotes  the 
maximum  of  the  likelihood  over  the  parameters,  and  m(k)  is  the  number  of  independent 
parameters  in  the  k-th  model.  For  a  given  criterion,  a(n)  is  the  cost  of  fitting  an 
additional  parameter  and  b(k)  is  an  additional  term  depending  upon  the  criterion  and  the 
model  k.  For  example,  Kashvap’s  (1982)  Criterion  (KC)  falls  under  the  expression  for  GIC 
given  in  (4.3).  Kashyap's  Criterion  (KC)  is  based  on  reasoning  similar  to  BIC  and  SC,  but 
contains  an  extra  term,  and  it  could  be  expected  to  perform  better.  However,  it  is  not 
conveniently  usable  in  applications,  especially  in  the  type  of  problems  we  are  looking  at  in 


this  paper.  In  KC,  the  extra  term  b(k)  =  loge[detB(k)]  where  det  denotes  determinant  and 
B(k)  is  the  negative  of  the  matrix  of  second  partials  of  logg  Li©(k)],  evaiuated  at  the 
maximum  likelihood  estimates.  Therefore,  for  our  purposes,  it  is  prohibitively  expensive  to 
compute  KC  and  its  extra  term.  For  this  reason,  we  have  chosen  only  to  work  with  AIC 
and  SC  which  are  sufficient  for  all  practical  purposes.  Hence,  we  have  chosen  not  to 
introduce  Kashyap's  Criterion  (KC)  here. 

We  next  derive  the  forms  of  AIC's  only  for  three  linked  but  different  multivariate 
models  for  the  convenience  of  the  readers.  For  more  details  on  this,  we  refer  the  reader 
to  Bozdogan  (1981,  1984),  Bozdogan  and  Sclove  (1984).  Derivations  of  SC's  follow  similarly. 

4.2  Multivariate  Models  and  Their  AIC’s 

Throughout  this  section  we  shall  suppose  that  we  may  have  independent  data  matrices 
X,,  X2,...,XK,  where  the  rows  of  X  (n  Xp)  are  independent  and  identically 

is 

distributed  (i.i.d.)  Np(ug,Zg),  g=  1,2 . K.  In  t«rms  of  the  parameters 

9  =  (u  ,u  . u  ,  Zi.Zo . Zw)  the  models  we  are  going  to  consider  are  as 

cm  1  ''■>'}  (MW  ‘“""A  ”6  IV 

follows: 

(i)  9  =  (M  ,h  u  ,  . Zjf)  [m  =  kp  +  kp(p+l)/2  parameters] 

(M  (M  j  /»j 2  — IV 

(ii)  9  =  (u  ,ti  . M  ,  Z.Z . Z)  {m  =  kp  +  p(p+l)/2  parameters] 

A«*  AM  1  AV  2  AM  J£  ” ” 

(iii)  9  =  (u,u . u,  Z,Z . Z)  (m  =  p  +  p(p+l)/2  parameters], 

AM  (M  (M  AM  ““ 

In  this  section  we  shall  derive  the  forms  of  AIC  for  these  models.  Recall  the 

definition  of  AIC  in  (4.1). 

AIC  =  -2  lo^L{9)  +  2m 

=  -  21ogg  (maximized  likelihood)  +  2m, 

where  m  denotes  the  number  of  free  parameters  within  the  model. 

4.2.1  AIC  for  the  Test  of  Homogeneity  of  Covariances  Model: 

AIC  ((u_,ZJ)  =  AIC  (varying  u  and  Z) 

'••'B  ““B 

Consider  K  normal  populations  with  different  mean  vectors  ug  and  different 
covariance  matrices  Z„,  g=  1.2 . K.  Let  X  .,  i  =  1,2,...,n„,  be  a  random  sample  of 


observations  from  the  g-tn  population  Py/i^.Zg). 

Now,  we  derive  the  form  of  Akaike’s  Information  Criterion  (AIC)  to  test  the 
hypothesis  that  the  covariance  matrices  of  these  populations  are  equal.  The  likelihood 
function  of  all  the  sample  observations  is  given  by 


=  gIi  Lg(^g;V. 

or  by 


l  =  <2/rrnp/2  n  iz„  rV2  x 


K  K 

exp<-l/2  tr  I  JT  1  -Ao  1  ' 2  tr  Z  n  Z~  1  (X  -u  )(X  -u  )'}, 

os]  o  o  n  g  1  6  o  ■Vo  ‘'•'O 


K  g 

where  n  =  I  n  and  A  =  Z  (X  x  ){X  .-  X  )’ 

g  g  |  O  O  ^  |  VO  V  O  *  ^•'O 


The  log  likelihood  function  is 
4(wg.Zg;X)  =  logeL(Mg,Ig;x) 

K 

=  -(np/2)  lo^(2*)  -  1/2  Z^  loge  I  Zg ! 

K  K 

-  1/2  tr  Z  rl  A.  -  1/2  tr  Z  r.  Zl^x  -u)(X  -a  >’  . 

om  1  o  o  gS  |  e  6  vg  vg 


The  maximum  likelihood  estimates  (MLE's)  of  yg  and  Zg  are 


u  =  X  , 
-8  ~g 


zg  =  Vng  • g  =  h2 . K- 


Since  the  K  populations  are  independent,  the  likelihood  of  ail  the  sample  observations  is 
simply  the  product  of  the  separate  likelihoods,  and  so  maximizing  (4.5)  is  equivalent  to  as 
maximizing  the  individual  sample  likelihoods,  separately.  This,  thus,  yields  the  MLE’s  given 
in  (4.6)  and  (4.7)  above. 

Substituting  the  MLE’s  into  (4.5)  and  simplifying,  the  maximized  log  likelihood  becomes 


(4.8) 


i((fig,2g);X)  =  loge  L({j&g,Zg);X) 


Since 


=  -  (np/2)  lo^(2/r)  -  1/2  Z  logeln”lAgi  -  (np/2). 


AIC  =  -2  log  L<0)  +  2m, 
0 


where  m  =  kp  +  kp(p+l)/2  is  the  number  of  parameters,  the  AIC  becomes 


AICUm-XJ)  =  AIC  (varying  u  and  Z) 

av©  ©  **  ““ 


=  np  log  (2*)  +  2  n  log  in“AA  I  +  np  +  2[kp+kp(p+l)/2j. 

e  e  ®  © 


(4.10) 


Since  the  constants  do  not  affect  the  result  of  comparison  of  models,  we  could  ignore 
them  and  reduce  the  form  of  AIC  to  a  much  simpler  form 


AIC  ((u_,ZJ)  3  AIC  (varying  u  and  Z) 

ev  g  —— 


=  Z  n  log  I A  I  +  2(kp+kp(p+l)/2], 

g-1  8  e  8 


where 


(4.11) 


ng  =  sample  size  of  group  or  sample  g  =  1,2 . K, 

I  Ag  I  =  the  determinant  of  sum  of  squares  and  cross-products  (SSCP)  matrix 

for  group  or  sample  g  =  1,2 . K, 

k  =  number  of  groups  or  samples  compared,  and 
p  =  number  of  variables. 

However,  for  purposes  of  comparison  we  retain  the  constants  and  use  AIC  given  by  (4.10). 


4.2.2  AIC  for  the  Multivariate  Analysis  of  Variance  (MANOVA)  Model: 
AIC  ((ti„,Z))  s  AIC  (varying  ji  and  common  Z) 

^©  “  <*v  ' 


Consider  in  this  case,  K  normal  populations  with  different  mean  vectors  y„,  g  = 

1,2 K,  but  each  population  is  assumed  to  have  the  same  covariance  matrix  .  Let 

X  .,  g  =  1,2 . K;  i  =  l,2,...,n_,  be  a  random  sample  of  observations  from  the  g-th 

<%ogl  g 

population  N n(w _,Z). 

To  derive  Akaike's  Information  Criterion  (AIC)  in  this  case,  we  use  the  log  iikeiihood 
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function  given  in  (4.5).  Since  each  population  is  assumed  to  have  the  same  covariance 
matrix  Z,  the  log  likelihood  function  becomes 


i((uJ.2;X)  =  logL((u  },I;X) 


(4.12) 


=  -  <np/2)log<2*)  -  <n/2)loglZ  I  -  l/2tr  Z"1  ^-g 


-  l/2trZ  Z  n  (x  -  4,)'  . 

g»  i  °  ^8 

and  the  maximum-likelihood  estimates  (MLE's)  of  and  Z  are 


and 


~  £g’  8  ~  . 


1 

Z  =  n-1  W  . 


(4.13) 


(4.14) 


where 


W  =  Z  A  . 

g-1  8 


Substituting  these  back  'into  (4.12)  and  simplifying,  the  maximized  log  likelihood 
becomes 


4<{mJ.Z;£)  =  log  L((u  ),Z;x) 


(4.15) 


=  -  (np/2)log(2?r)  -  (n/2)log  I  n-^  w  I  -  (np/2), 

where  w  is  the  "within-groups"  SSP  matrix. 

Since 

AIC  =  -2lo^L<<9)  +  2m,  (4.16) 

where  m  =  kp  +  p(p+l)/2  is  the  number  of  parameters,  then  AIC  becomes 


2  1 


(4.17) 


ArC((ag,B)  s  AIC  (varying  u  and  common  Z) 

=  np  log  AZk)  +  nlog_ln“^l  +  np  +  2[kp  +  p  ^  ^ 1  ^  j. 

Since  the  constants  do  not  affect  the  result  of  comparison  of  models,  we  could  ignore 
them  and  reduce  the  form  of  AIC  to  a  much  simpler  form 

AIC*«u  ,Z1)  =  AIC  (varying  u  and  common  I)  (4.18) 

AV*  - 

=  n  logg  I  w  I  +  2[kp  +  p  **  p^~ 1  ^  ]  , 


where 

n  =  Z  n  =  the  total  sample  size, 
g-1  g 


I  w  I  =  the  determinant  of  "within- groups”  SSP  matrix, 
k  =  number  of  groups  or  samples  compared, 
p  =  number  of  variables. 

However,  for  purposes  of  comparison  we  retain  the  constants  and  use  AIC  given  by  (4.17). 


4.3.3  AIC  for  the  Test  of  Complete  Homogeneity  Model: 
AIC  ((w,Z))  5  AIC  (common  «  and  Z) 

is*  “  <s*  " 


Consider  again  K  normal  populations  with  the  same  mean  vecotr  u  and  the  same 
covariance  matrix  To  derive  the  form  of  AIC  for  the  test  of  complete  homogeneity 
model,  we  set  ail  ji-’s  equal  to  u  ana  ail  the  Z„’s  equal  to  Z  in  (4.3)  in  Section 

'•-a  '*'»  e 

4.2.1.  and  obtain  the  log  likelihood  function  which  is  given  by 


l  *  logL  (m.Z:X) 

=  -  (np/2)log(2w)  -  <n/2)log!Z  1  -  l/2trZ~ 1  (w  +  B  ) 
-  (n/2)trZ-i(X  -u)(X  -u)’. 

“  A*  AV  AV  AS# 

The  MLE’s  of  ji  and  Z  are 

A# 


(4.19) 


% 


and 


<4.21) 


|  =  l/n(w  +  B)  =  T/n  . 

Substituting  these  back  into  (4.19),  we  have  the  maximized  log  likelihood 

i(w.i)  s  logL(u,S;X)  (4.22) 

AV  """  ev  cv 

=  -  (np/2)log(2ff)  -  (n/21og  i  n"*  T  i  -  (np/2). 

Thus,  using  the  equation  of  AIC  in  (4.9)  again,  where  m  =  p  +  p(p+l)/2  is  the  number 
of  parameters  this  time,  the  AIC  becomes 

AIC((u,2))  5  AIC  (common  u  and  T)  (4.23) 

=  np  logg  (2ir)  +  nlogg  I  n"^T  I  +  np  +  2[p  +  P  ^  P^~ 1  ?  ]. 

After  ignoring  the  constants,  AIC  takes  the  simplified  form 

AIC  ((xi.£))  =  AIC  (common  u  and  X)  (4.24) 

=  nloge  I  T  I  +  2[p  + 

where  I T I  =  the  determinant  of  the  "total"  SSP  matrix.  However,  for  purposes  of 
comparison  we  retain  the  constants  and  use  AIC  given  in  (4.23). 

4.3  AIC-Replacements  for  Multi-Sample  Conventional  Tests  of  Homogeneity  Models 


In  Section  4.2,  having  derived  the  exact  analytical  forms  of  Akaike's  Information 
Criterion  (AIC)  for  each  of  the  multivariate  models,  in  this  section,  we  shall  give  the 
AlC-replacements  for  the  multivariate  multi-sample  conventional  tests  of  homogeneity  models 
and  establish  the  relationship  of  AlC-replacements  with  that  of  the  conventional  procedures. 
For  more  details  on  this,  we  shall  refer  the  reader  to  Bozdogan  (1984). 

We  next  state  the  following  very  important  theorem  which  we  shall  utilize  in 
establishing  the  relationships  of  the  AlC-replacements  and  the  conventional  procedures. 
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Theorem  4.3.1  If  *s  a  parameter  space  in  R‘\  and  if  Qq  is  a  k-dimensionai  subspace 
of  Op  then  under  suitable  regularity  conditions,  for  each  9  €  Oq  ,  -21ogX  has  an 

2 

asymptotic  xK_k  distribution  as  n  -»  «*. 

Proof.  See,  for  example,  Wilks  (1932),  and  Silvey  (1970,  p.  113). 

We  are  now  in  a  position  to  give  the  AlC-replacements  for  the  multivariate 
multi-sample  conventional  tests  of  homogeneity  models. 

4.3.1  AIC-Replacement  for  Box's  M  for  Testing  Homogeneity  of  Covariances 

As  an  alternative  to  Box’s  M  test  for  testing  the  equality  of  covariance  matrices  for 
which  extensive  tables  are  not  readily  available,  we  may  summarize  the  condition  for 
rejecting 


Hoa  :  II  =  ^2  =  -  "  — K  * 


(4.25) 


against 


as  follows: 


Hja  :  Not  all  K  covariance  matrices  are  equal. 


Relation  4.3.1.  We  reject  HQa  (test  of  homogeneity  of  covariances)  if 


AIC((wg,Z))  >  AIC((Mg,Ig)), 


(4.26) 


AAIC(Hob  ;  Hoa)  =  AIC((ug,Z))  -  ATC(U<g,Zg})  >  0 


(4.27) 


iff  nlog  ln“*Wl  -  Z  n_log„  I  n”  1  A  I  >  (k-l)p(p+l) 
“  "  8  "  6  “6 

iff  -21ogX?ja  >  (k-l)p(p+l)  , 


(4.28) 


(4.29) 


where  AIC((n_,Z))  is  given  in  (4.17)  and  AIC((m_,ZJ)  is  given  is  (4.10),  and 
where  -21ogXt)a  has  an  asymptotic  chi-squared  distribution  with  l/2(k-l)p(p+l)  degrees  of 
freedom  by  Theorem  4.3.1.  Using  this  fact,  we  establish  the  following: 


Relation  4.3.2.  For  comparing  pairs  of  models. 


X2  2  AIC«Mg.I»  -  AICdMg.Zg})  ♦  2(^<k-l)p(p+l)]. 
where  X2  is  tested  as  a  chi-square  with  degrees  of  freedom  d.f.  =  l/2(k-l)p(p+l). 


(4.30) 


4.3.2  AIC-Replacement  for  Wilks'  A  Criterion  for  Testing  the  Equality  of  Mean  Vectors 

As  an  alternative  to  Wilks’  A  Criterion,  Bartlett’s  V  statistic,  and  other  conventional 
procedures  for  testing  the  equality  of  mean  vectors  given  a  common  covariance  matrix 
between  the  groups  or  samples,  we  may  summarize  the  condition  for  rejecting 


*ob  :  Si  =  “2  =  -  = 


(4.31) 


against 


as  follows: 


Hjb  :  Not  all  are  equal, 


Relation  4.3.3.  We  reject  HQb  (one-way  multivariate  analysis  of  variance 


AIC((W,2)  >  AIC((u  ,Z))  . 

<s* 


(4.32) 


AAIC(Hqc  ;  Hob)  =  AIC((u,I))  -  AIC((ug.Z}>  >  0 


(4.33) 


iff  nlog  ln~*TI  -nlog  ln“*w|  >  2p(k-l) 


(4.34) 


iff  -21og)^b  >  2p(k-l) 


(4.35) 


because  this  test  is  done  under  the  assumption  of  a  common  Z. 

AICUm-.Z))  is  given  in  (4.17)  and  AIC({jj,Z))  is  given  in  (4.23),  and  where 
— 21og)y>j>  has  an  asymptotic  chi-squared  distribution  with  p(k-l)  degrees  of  freedom  by 
Theorem  4.3.1.  Again,  using  this  fact,  we  establish  the  following: 


Relation  4.3.4.  For  comparing  fiajrs  of  models. 

X2  2  AIC«m.2»  -  AI({Mg,2))  +  2[p(k-l)],  (4.36) 
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where  x  is  tested  as  a  chi-square  with  degrees  of  freedom  d.f.  =  p(k-l). 

4.3.3  AIC-Replacement  for  Testing  Complete  Homogeneity 

Combining  the  results  in  Section  4.3.1  and  4.3.2,  we  may  summarize  the  condition  for 
rejecting 

Hoc  :  Kl  =  ^2  =  -  =  Kk  and  ^1  =  ^2  =  -  =  — k  (4-37' 

against 

Hjc  :  Not  all  mean  k  vectors  and  covariance  matrices  are  equal, 

as  follows: 

Relation  4.3.5.  We  reject  HQC  (test  of  complete  homogeneity)  if 


AICUu.I))  >  AIC«u  ,!))  (4.38) 

and 

AIC((ug.Z»  >  AIC<(ag.Zg))  , 

or  if 

AIC((m.Z»  >  AIC((u  ,ZJ)  ,  (4.39) 

^  a/* 

A  -A, 

since  LQ^  =  LQC  .  That  is,  reject  HQC  if 

AAIC(Hqc  ;  H0fl)  =  AIC((m,I))  -  AIC((ug,Zg))  >  0  (4.40) 

iff  nlog  In-1!!  -  Z  n  log  In'^A  I  >  p(p+3)(k-l)  (4.41) 

iff  -21og>bc  >  p(p+3)(k-l)  ,  (4.42) 


where  AIC((ug,Zg})  is  given  in  (4.10),  AIC((ug,Z))  is  given  in  (4.17),  and 
AIC((u,Z))  is  given  in  (4.23). 

We  note  that  -21og>&c  has  an  asymptotic  chi-squared  distribution  with  l/2p((p+3)(k-l) 
degrees  of  freedom  by  Theorem  4.3.1.  Thus,  we  now  establish  our  final  relation  as  follows: 


Relation  4.3.6.  For  comparing  pairs  of  models. 


x2  2  AICHju.2))  -  AICKWg.Zg))  +  2[yp(p+3Hk-l)i, 


(4.43) 


where  xC  is  tested  as  a  chi-square  with  degrees  of  freedom  d.f  =  l/2p(p+3)(k-l). 
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5.  NUMERICAL  EXAMPLES 


| 

In  this  section  we  shall  five  two  different  numerical  examples  and  study  Multi-Sample 
Cluster  Analysis  (MSCA)  as  an  alternative  to  Multiple  Comparison  Procedures  (MCP's).  In 
Section  5.1,  we  shall  study  tests  of  homogeneity  from  model-selection  viewpoint  for  the 
varieties  of  rice  data  set  given  in  Srivastava  and  Carter  (1983,  p.100),  where  the  detailed 
conventional  analysis  of  this  data  set  is  discussed  and  treated.  In  Section  5.2,  we  shall 
show  the  application  of  MSCA  in  designing  optimal  decision  tree  classifiers  which  are 
popular  in  remote  sensing  technology. 

5.1  Multi-Sample  Clustering  of  Varieties  of  Rice 

Suppose  four  varieties  of  rice  (see,  e.g.,  Srivastava  and  Carter,  1983),  namely  variety 
A,  B,  C,  and  D  are  sown  in  20  plots,  where  each  variety  of  rice  is  assigned  at  random  to 
five  plots.  Two  variables  were  measured  six  weeks  after  transplanting:  Xj,  the  height  of 
the  plant,  and  X2>  the  number  of  tillers  per  plant.  Thus  for  this  data  set  we  have  p  =  2 
characteristics,  ng  =  5,  g  =  1,2, 3, 4,  and  n  =  Zng  =  20. 

We  next  study  tests  of  homogeneity  for  this  data  set  by  using  our  procedure,  show 
step-by-step  analysis  and  compare  our  results  with  that  of  the  conventional  tests. 

(i)  Identification  of  the  Best  Fitting  Parametric  Model: 

We  present  the  summary  of  the  AlC-values  under  the  three  parametric  multivariate 
normal  models  as  follows: 

AIC((xi  ,ZJ)  s  AIC(  varying  u,  and  I  )  =  186.324  (5.1) 

AIC((u  ,Z))  =  AIC(  varying  u  and  common  Z  )  =  178.290  (5.2) 

AIC((u,Z})  =  AIC(  common  w  and  Z  )  =  185.440  (5.3) 


The  minimum  AIC  occurs  under  the  MANOVA  model  in  (5.2).  Therefore,  according  to 
the  definition  of  AIC,  the  MANOVA  model  is  the  best  fitting  model  for  the  analysis  of  the 
varieties  of  rice  data  set.  In  other  words,  we  are  accepting  the  equality  of  covariance 
matrices  for  this  data  set.  In  fact,  if  we  perform  a  conventional  multivariate  test  for  the 
homogeneity  of  covariance  matrices,  we  obtain  Box's  M  =  7.97272  or  Xq  =  6.173  with 


P-value  =  .722  (approximately).  Hence,  the  acceptance  of  the  test  of  homogeneity  of 
covariance  matrices  clears  the  way  for  a  test  on  the  homogeneity  of  the  variety  mean 
vectors  which  is  the  MANOVA  null  hypothesis.  As  we  saw  above,  the  minimum  AIC 
procedure  already  picked  the  MANOVA  model  as  the  best  fitting  model  for  this  data  set. 

(ii)  Test  of  Homogeneity  of  Mean  Vectors: 

Having  determined  the  best  fitting  model,  that  is,  the  MANOVA  model,  we  now  test 
the  null  hypothesis: 

Hob  :  **<A)  =  H(B)  =  £<C)  =  H(D) 

against  the  alternative  hypothesis  which  negates  H^.  Using  Relation  4.3.3,  since 

AIC((w,D)  =  185.440  >  AIC((u  ,S}>  =  178.290,  (5.4) 

we  reject  H^,  and  claim  that  there  is  a  difference  in  varieties. 

(iii)  Multiple  Comparisons  Under  the  Best  Fitting  Model: 

Now  we  need  to  compare  four  varieties  of  rice  simultaneously  under  the  best  fitting 
model,  that  is,  under  (u  ,Z).  in  terms  of  the  parameters.  For  this  we  proceed  to  use 
AIC((m0,Z))  to  compare  the  four  varieties  of  rice  pairwise.  Our  results  are  presented 
in  Table  5.1. 

Looking  at  Table  5.1,  we  see  that,  using  all  the  variables  simultaneously,  the  first 
minimum  AIC  occurs  at  the  alternative  submodel  2  where  we  have  (A,C)  as  one 
homogeneous  pair.  Second  best  homogeneous  pair  is  (B,D).  We  never  choose  the  pair 
(A,B),  that  is,  submodel  1,  since  its  AIC  value  is  quite  large  indicating  the  inferiority  of 
this  submodel,  or  indicating  that  there  is  a  difference  between  varieties  A  and  B,  and  that 
they  should  not  be  put  together  as  one  homogeneous  group. 

Although  the  pairwise  comparison  is  the  most  commonly  used  Multiple  Comparison 
Procedure  (MCP)  in  the  literature,  it  is  not  general,  and  informative.  It  only  considers  the 
variabilities  in  pairs  of  groups  or  samples,  and  it  ignores  the  variabilities  in  other  groups. 
Therefore,  for  this  reason,  we  shall  next  propose  our  new  methodology,  that  is, 
Multi-Sample  Cluster  Analysis  (MSCA),  as  an  alternative  to  Multiple  Comparison  Procedures 
(MCP’s). 


(iv)  Multi-Sample  Cluster  Analysis  (MSCA)  of  Varieties: 


We  now  cluster  K  =  4  samples  (varieties  of  rice)  into  k  =  1,2,3,  and  4-Sample 
Clusters  on  the  basis  of  all  the  variables,  where  p  =  2  in  this  case.  We  obtain  in  total 
fifteen  possible  clustering  alternatives  by  using  STIRN2  subroutine  in  Appendix  A. 2.  Using 
a  newly  developed  statistical  computer  software  by  this  author  called  AICPARM:  A  General 
Purpose  Program  for  Computing  AIC's  and  SC's  for  Univariate  and  Multivariate  Normal 
Parametric  Models,  and  using  the  MANOVA  model  as  our  best  fitting  model,  we  obtained 
the  results  given  in  Table  5.2. 

Looking  at  Table  5.2,  we  see  that,  the  minimum  AIC  and  SC  clustering  occurs  at 
alternative  submodel  7,  that  is,  k  =  2-Sample  Clusters  is  (1,3)  (2,4)  =  (A,C)  (B,D),  indicating 
that  there  seems  to  be  two  types  of  varieties  of  rice  rather  than  four  varieties.  The 
second  minimum  AIC  and  SC  occur  at  the  alternative  submodel  13  and  at  k  =  3-Sample 
Clusters  where  we  have  (1,3)  (2)  (4)  =  (A,C)  (B)  (D)  as  our  clustering,  telling  us  that  if  we 
were  to  cluster  any  one  of  the  two  varieties  of  rice,  we  should  cluster  varieties  A  and  C 
together  as  one  homogeneous  group,  and  we  should  cluster  varieties  B  and  D  completely 
separately.  We  note  that  the  larger  values  of  AIC  and  SC  are  indications  of  the 

inferiority  of  the  submodels.  Furthermore,  we  can  see  the  effect  of  clustering  each  variety 
by  looking  at  the  differences  of  AIC's  and  SC's  across  each  clustering  alternative. 
According  to  AIC  and  SC,  the  most  inferior  submodel  is  8  where  we  have 
(1,2)  (3,4)  =  (A,B)  (C,D)  as  our  clustering. 

In  comparing  our  results  in  Table  5.1  and  5.2,  we  see  that  Multi-Sample  Cluster 

Analysis  (MSCA)  is  much  more  general  and  informative  than  the  pairwise  Multiple 

Comparison  Procedures  (MCP's)  to  be  used  for  simultaneous  comparative  inference. 

(v)  Determining  the  Variables  Contributing  Most  to  U\e  Differences  in  Varieties: 

Since  there  is  heterogeneity  in  the  mean  vectors  (or  locations)  of  the  four  varieties 
of  rice,  we  further  proceed  on  the  basis  of  univariate  theory  to  study  the  behaviour  of  the 

variety  data  on  each  of  the  p  =  2  variables  Our  results  are  given  in  Table  5.3. 

Interpreting  the  results  in  Table  5.3.  we  note  that  X£  =  number  of  tiilers  per  plant 
shows  significant  homogeneity  across  four  vanef.es  of  rice,  and  in  fact,  is  the  best  variable 

according  to  the  minimum  AIC  vaiue.  The  :  :»•  variable,  that  is,  Xj  =  height  of  plant .  on 

the  basis  of  the  AIC  value  indicates  tha*  •re  is  a  difference  in  heights  between  the 

varieties.  The  general  conclusion  is  that  there  exists  more  heterogeneity  in  means  on 
variable  Xj  than  X£  across  the  four  types  of  varieties  of  rice. 


5.2  Application  of  Multi-Sample  Cluster  Analysis  in  Designing  Decision  Trees  in  Remote 
Sensing 


In  remote  sensing  technology,  the  decision  tree  classifier  has  been  widely  used  in 
various  problems  in  geoscience  and  remote  sensing,  speech  analysis,  biomedical  applications, 
etc.,  and  in  many  other  areas.  For  more  on  this,  we  refer  the  reader  to  Argentiero  et  al. 
(1982),  Kulkarni  and  Kanal  (1976),  Mui  and  Fu  (1980),  Wang  and  Suen  (1984),  and  others. 

Using  a  decision  tree  classifier  over  a  single  stage  classifier,  we  have  an  advantage 
in  the  sense  that,  a  complex  global  decision  can  be  made  via  a  series  of  simple  and  local 
decisions.  This  enables  us  to  use  a  decision  tree  classifier  in  two  main  types  of 
applications: 

(i)  recognition  of  pattern  classes,  and 

(ii)  tree  classifier  can  make  a  decision  much  more  quickly  compared  to  single  stage 

classifier. 

For  example,  in  remote  sensing  problems  one  is  faced  with  an  image  (or  scene)  which 
is  a  rectangular  array  with  I-rows  (scan  lines),  and  J-columns  (the  number  of  resolution 
elements  per  scan  line  of  one  resolution  element  (an  individual).  Each  cell  (individual  or 

pixel)  generates  a  pXl  measurement  vector  X..,  i  =  1,2 . 1,  and  j  =  1,2 . J.  We  denote 

the  features  by 


xr  x2, 


The  vector  feature  is 


;  =  <xr  x2 . y 

The  observed  digital  image  is 

(xu  :  i  =  1,2 . 1,  j  =  1,2,..., J), 


where 


*ij  =  *xlij’  x2ij .  xpij> 


is  the  vector  of  numerical  values  of  the  p  features  at  pixea  (i.j).  For  more  on  this,  see 
also  Sciove  (1982). 

In  order  to  recognize  an  image  (or  scene),  we  need  to  perform  classification,  that  is, 
grouping  of  pixels,  to  check  the  homogeneity  of  large  dimensional  Multispectral  Scanner 


1 


3  1 
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(MSS)  data  sets  with  a  view  toward  identifying  objects,  and  recognizing  the  pattern  classes, 
and  so  forth.  This  is  the  major  task  of  cluster  analysis  techniques. 

After  the  features  are  extracted,  a  decision  rule  is  then  applied  to  assign  the 
reduced  dimensional  samples  to  available  classes  by  merging  and  subjecting  these  samples  to 
a  sequence  of  decision  rules  before  they  are  assigned  to  a  unique  class.  Such  an  approach 
further  reduces  the  dimensionality  of  these  large  dimensional  data  sets,  and  it  results  in  an 
optimal  decision  tree  classifier  which  is  computationally  efficient,  accurate,  and  flexible. 
Argentiero  et  al.  (1982),  give  an  example  on  how  to  design  an  optimal  decision  tree 
classifier  by  using  a  conventional  statistical  procedure,  namely  the  multivariate  F-test,  and 
give  the  associated  table  look-up  decision  tree  classifier  on  a  simulated  heterogeneous 
multivariate  data  set  where  both  the  mean  vectors  and  the  covariance  matrices  among  the 
five  classes  are  varying.  It  seems  that  such  an  approach  is  primitive  and  the  decision  rule 
at  each  stage  depends  upon  a  given  significance  level  a.  Also  it  is  not  clear  how  they 
controlled  the  overall  error  rate  in  their  study.  We  cannot  simply  use  the  usual  F-tables  in 
the  presence  of  covariance  heterogeneity  without  testing  the  equality  of  covariance 
matrices. 

To  provide  an  example  of  Multi-Sample  Cluster  Analysis  (MSCA)  for  the  classification 
of  large  dimensional  data  sets  arising  from  the  merging  of  remote  sensing  data,  we 
reconstructed  the  data  structure  presented  in  Argentiero  et  ai.  (1982)  with  different  sample 
sizes.  That  is,  we  simulated  100  different  p  =  4  variate  multivariate  normal  samples  from 
the  K  =  5  classes  using  the  IMSL  procedure  GGNSM.  The  simulated  data  was  based  on 
the  class  statistics  given  in  Table  5.4  which  were  obtained  from  a  Landsat-2  satellite  over 
a  midwestern  county.  The  five  classes  were  consisted  of  two  types  of  winter  wheat  and 
three  confusion  crops,  or  non-wheat  crops.  The  four  channels,  that  is,  p  =  4,  are  those  of 
the  Multispectral  Scanner  on  board  of  the  Landsat-2.  The  number  of  observations  in  each 
class  are  as  follows:  n^  =  50,  ng  =  75,  ng  =  100,  n^  =  125,  and  n^  =  150  in  total  of 

n  =  Zng  =  500  observations.  A  priori  class  probabilities  are  assumed  to  be  equal. 

We  note  that  the  correct  parametric  model  for  the  simulated  data  is  varying  mean 
vectors  and  the  varying  covariance  matrices  which  was  checked  by  our  procedure. 

Each  of  the  100  different  samples  of  multivariate  data  were  then  analyzed  using  the 

AICPARM  program  of  Bozdogan  (1983).  The  results  of  one  such  sample  is  given  in  Table 

5.5  for  clustering  K  =  5  simulated  class  types  of  different  groups  into  k  =  1,2, 3,4,  and 
5-Sample  Clusters  on  all  variables  and  the  corresponding  AIC's  and  SC's  are  shown  for  each 
of  the  clustering  alternatives. 

Looking  at  Table  5.5,  we  see  that  for  this  particular  sample  AIC  picks  k  =  5  as 
being  the  correct  number  of  classes  (submodel  52),  and  then  among  the  k  =  4-Sample 
Clusters  it  picks  alternative  submodel  47;  among  the  k  =  3-Sample  Clusters  it  picks 


submodel  25;  and  finally  among  the  k  =  2-Sample  Clusters  it  picks  submodel  2  as  the  best 
clustering  alternative,  respectively  in  a  hierarchical  fashion.  According  to  AIC,  we  never 
cluster  the  five  class  types  as  one  homogeneous  group  (submodel  1). 

Looking  at  the  same  results  for  SC's  we  see  that  SC  incorrectly  picks  k  =  4-Sample 
Cluster  in  submodel  47  as  being  the  correct  structure,  demonstrating  a  tendency  toward 
underfitting  the  underlying  random  process. 

Among  ail  the  100  different  samples,  we  also  selected  the  best  clustering  alternatives 
for  k  =  1,2, 3, 4,  and  5-Sample  Clusters  on  the  basis  of  the  minimum  AIC  and  SC  procedures. 
The  following  is  a  list  of  these  best  clusters: 


k  =  1 


k  =  2 


k  =  3 


k  =  4 


k  =  5 


(1.2. 3. 4. 5) 

(1.2. 3. 4)  (5) 

(1.4.5)  (2,3) 

(1.4)  (2,3)  (5) 

(1)  (4)  (2,3)  (5) 
(1)  (2)  (3)  (4)  (5) 


100  times  out  of  100  samples  =  100% 

75  times  out  of  100  samples  =  75% 

25  times  out  of  100  samples  =  25% 

100  times  out  of  100  samples  =  100% 

100  times  out  of  100  samples  =  100% 

100  times  out  of  100  samples  =  100% 


This  result  is  shown  in  Figure  5.1  in  terms  of  a  decision  tree  which  is  the  structure 
of  our  optimal  decision  tree  classifier.  The  suboptimal  decision  tree  classifier  is  shown  in 
Figure  5.2,  and  the  tree  which  was  picked  incorrectly  by  SC  is  shown  in  Figure  5.3. 

Thus,  in  general,  if  we  already  know  a  priori  what  k  should  be,  AIC  and  SC  agree 
over  all  the  samples  on  which  clustering  is  optimal.  In  our  experiment,  AIC  always  chose 
the  optimal  value  of  k  as  five,  the  correct  number  of  underlying  heterogeneous  normal 
populations  from  which  each  of  the  samples  were  t  ’’en.  However,  SC  only  picks  5 
populations  99%  of  the  time.  It  incorrectly  picks  k  =  4  in  one  of  the  samples  the  results  of 
which  are  given  in  Table  5.4  as  we  discussed  above. 
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6.  CONCLUSIONS 


The  results  of  the  above  method  clearly  illustrate  the  flexibility  of  the  minimum  AIC 
and  SC  procedures  over  the  classical  hypothesis  testing.  We  see  that  AIC  and  SC  can 
indeed  identify  the  best  clustering  alternatives  when  we  cluster  samples  into  homogeneous 
sets  of  samples  under  the  best  fitting  model.  We  can  detect  the  source  of  heterogeneity 
without  any  lengthly  calculations  or  subjectivity,  and  we  can  measure  the  amount  of 
homogeneity  and  heterogeneity  in  clustering  samples.  With  this  new  approach  it  is  now 
possible  to  determine  a  priori  whether  we  should  use  equal  or  varying  covariance  matrices  in 
the  analysis  of  a  data  set.  We  can  reduce  the  dimensionality  of  data  sets  as  shown  on  the 
variety  of  rice  data  set,  and  we  do  not  need  to  assume  any  arbitrary  level  of  significance  a 
and  table  look-up. 

The  model  selection  by  AIC  and  SC  is  also  more  satisfying  since  all  the  possible 
clustering  alternatives  are  considered. 

Thus,  from  the  results  presented  in  this  paper,  we  see  that  both  AIC  and  SC  unify 
the  conventional  test  procedures  and  avoid  the  existing  ambiguities  inherent  in  these 
procedures.  They  avoid  any  restriction  on  K,  the  number  of  classes  or  groups,  and  p,  the 
number  of  variables.  The  use  of  AIC  and  SC  show  how  to  combine  the  information  in  the 
likelihood  with  an  appropriate  function  of  the  parameters  to  obtain  estimates  of  the 
information  provided  by  competing  alternative  models.  Therefore,  the  definition  of  AIC  and 
SC  give  clear  formulation  of  the  principle  of  parsimony  in  statistical  model  building  or 
comparison  as  we  demonstrated  by  numerical  examples. 

In  concluding,  the  new  approach  presented  in  this  paper  will  provide  the  researcher 
with  a  concise,  efficient,  and  a  more  refined  way  of  studying  simultaneous  comparative 
inference  for  a  particular  multi-sample  data  set.  The  ability  of  AIC  and  SC  to  allow  the 
researcher  to  extract  global  information  from  the  results  of  fitting  several  models  is  a 
unique  characteristic  that  is  not  shared  by  the  conventional  procedures  nor  is  it  realized  by 
conventional  significance  tests. 

Therefore,  for  these  reasons  the  use  of  model-selection  criteria  is  recommended  in 
conjunction  with  Multi-Sample  Cluster  Analysis  (MSCA)  as  an  alternative  to  Multiple 
Comparison  Procedures  (MCP's). 


APPENDIX  :  COMBINATORIAL  SUBROUTINES 


Here  we  give  a  listing  of  major  combinatorial  subroutines  which  we  implemented  in  a 
newly  developed  statistical  computer  software  by  this  author  called  AICPARM:  A  General 
Purpose  Program  for  Computing  AIC's  for  Univariate  and  Multivariate  Normal  Parametric 
Models.  For  a  lucid  discussion  and  details  on  combinatorial  algorithms,  we  refer  the  reader 
to  Nijenhuis  and  Wilf  (1978). 

A.l  MCP:  Combination  of  K  samples  Taken  k  at  a  Time  for  MCP’s  in  Lexicographic  Order 

This  subroutine  generates  and  lists  different  combinations  of  K  groups  or  samples 
taken  k  at  a  time  sequentially.  There  are  k-subsets  of  a  K-set  altogether,  and  MCP 

is  a  simple  algorithm  which  constructs  the  all  possible  alternatives  in  "lexicographic",  that 
is,  in  "alphabetical  order".  A  listing  of  the  output  from  this  program  is  shown  in  Table 
3.1. 


PROGRAM  MCP 

PROGRAM  MCP 
INTEGER  A(100),N,K,H,M2 
LOGICAL  MTC 
MTC  =  .FALSE. 

PRINT  VK  CHOOOSE  k' 

PRINT  VWHAT  IS  K?' 

READ  *,N 

PRINT  VWHAT  IS  k?' 

READ  *,K 
10  CONTINUE 

CALL  NEXKSB(N,K,A,MTC,H,M2) 
PRINT  2,  (A(I),I=1,K) 

2  FORMAT(30(1X,I1() 

IF(MTC)  GOTO  10 


SUBROUTINE  NEXKSB(N,K,A,MTC) 
INTEGER  AIK) 

LOGICAL  MTC 
INTEGER  H,M2 
SAVE  H,M2 
30  IF(MTC)  GOTO  40 
20  M2=0 
H=K 

GOTO  50 

40  IFIM2.LT.N-H)  H=0 
H=H+1 

M2=A(K+1-H) 

50  DO  51  J=1,H 

51  A(K+J-H)=M2+J 
MTC=A(1).NE.N-K+1 
RETURN 


A.2  STIRN2:  Stirling  Number  of  the  Second  Kind 


This  subroutine  constructs  a  table  of  the  total  number  of  clustering  alternatives  for 
various  values  of  K,  number  of  samples,  and  k  varying  number  of  clusters  of  samples.  A 
listing  of  the  output  from  this  program  is  shown  in  Table  3.2. 


PROGRAM  STIRN2 

PROGRAM  STIRN2 
REAL  S(20,20),T 
INTEGER  N,K 

S(l,2)=0. 

DO  5  1=1,20 
S(I,U=1. 

5  CONTINUE 

PRINT  30,'TOTAL',(I,I=1,20) 

30  F0RMAT(13X.A,6<I14,1X),3(:/T19,6<I14.1X))  ) 

40  F0RMAT(I2.1X,7(I14,1X),3(:/T19,6(I14,1X))  ) 

PRINT  40.1,1,1 
DO  20  N=2,20 
T=l. 

DO  10  K=2,N 

S(N,K)=K*S(N-1,K)+S(N-1,K-1) 

T=T+S(N,K) 

10  CONTINUE 

PRINT  40,N,T,(S(N,I),I=1,N) 

20  CONTINUE 


A.3  REPFM:  Representation  Forms  of  Clustering  Alternatives 


Clustering  alternatives  can  be  classified  according  to  their  representation  forms  to 
make  it  easy  to  list  all  possible  clustering  alternatives.  The  subroutine  REPFM  gives  the 
partition  of  K  (number  of  samples)  which  is  a  positive  integer,  into  a  specified  k  number  of 
parts.  For  example,  the  representation  forms  of  K=6  samples  into  k=3  parts  are: 

6  =  (4)  +  (1)  +  (1) 

=  (3)  +  {2}  +  (1) 

=  (2)  +  (2)  +  {2}  . 


PROGRAM  REPFM 

PROGRAM  REPFM 
INTEGER  N,D,I,K 
INTEGER  R(100),M(100) 

LOGICAL  MTC, FIRST 
EXTERNAL  NEXPAR 
MTC  =  .TRUE. 

FIRST  =  .TRUE. 

PRINT  VWHAT  IS  N?' 

READ  *,N 
10  CONTINUE 

IF(MTC)  CALL  NEXPAR(N,R,M,D, MTC, FIRST) 
PRINT  2,((R(I),K=1,M(I)),I=1,D) 

2  FORMAT  (30(1X.I1)) 

IF(MTC)  GOTO  10 

STOP 

END 

SUBROUTINE  NEXPAR(N,R.M,D,MTC, FIRST) 
INTEGER  N,M,R,S,D,SUM,F 
LOGICAL  MTC, FIRST 
DIMENSION  R(N),M(N) 

INTRINSIC  MOD 


IF(. NOT. FIRST)  GOTO  20 
FIRST  =  .FALSE. 

30  S=N 
D=0 

50  D=D+1 
R(D)=S 
M(D)=1 

40  MTC=M(D).NE.N 
RETURN 

20  IF(.NOT.MTC)  GOTO  30 
SUM=1 

IF(R(D).GT.l)  GOTO  60 

SUM=M(D)+1 

D=D-1 

60  F=R(D)-1 

IF(M(D).EQ.l)  GOTO  70 

M(D)=M(D)-1 

D=D+1 

70  R(D)=F 

M(D)=1+SUM/F 

S=MOD(SUM,F) 

IF(S)  40,40,50 


A.4  ALLSUB:  All  Possible  Partitioning  of  K-Samples  into  k-Sample  Clusters 
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This  subroutine  generates  and  lists  ail  the  simple  patterns  of  clustering  alternatives 
for  a  specified  number  of  samples  K  for  Multi-Sample  Cluster  Analysis.  A  listing  of  the 
output  from  this  program  is  shown  in  Table  3.3. 


PROGRAM  ALLSUB 

PROGRAM  ALLSUB 
INTEGER  N.NC 
INTEGER  P(100),Q(100) 
CHARACTER*80  LIST 
EXTERNAL  NEXEQU 
LOGICAL  MTC 

PRINT  VHOW  MANY  GROUPS?' 

READ  *,N 
MTC  =  .FALSE. 

10  CALL  NEXEQU(N,NC,P,Q,MTC) 

CALL  NEXLST(N,P,Q,NC) 

IF(MTC)  GOTO  10 
END 

SUBROUTINE  NEXEQU(N,NC.P.Q,MTC) 
INTEGER  N,NC 
INTEGER  P(N),Q(N) 

LOGICAL  MTC 
SAVE 

IF(MTC)  GOTO  20 

10  NC=1 

DO  11  1=1, N 

11  Q(I)=1 
P(1)=N 

60  MTC=NC.NE.N 
RETURN 
20  M=N 
30  L=Q(M) 

IF(P(L).NE.l)  GOTO  40 

Q(M)=1 

M=M-1 


4  1 


\ 

1 
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Table  3.1 


A  Simple  Pattern  of  Clustering  Alternatives  of  Multiple  Comparison 
for  Different  Combinations  of  K  Samples  Taken  k  at  a  Time 


Combinations 


Alternatives 


Clustering 

2-Subsets  3-Subsets  4-Subsets 


(1,2.3) 

(1.2)  (3) 

(1.3)  (2) 

(2.3)  (1) 


(1)  (2)  (3) 


(1.2, 3, 4) 


(2.3.4)  (1) 

(1.3.4)  (2) 

(1.2.4)  (3) 

(1.2.3)  (4) 

(1.4)  (2,3) 

(1.3)  (2,4) 

(1,2)  (3,4) 


(3.4)  (1)  (2) 

(2.4)  (1)  (3) 

(2.3)  (1)  (4) 

(1.4)  (2)  (3) 

(1,3)  (2)  (4) 
(1,2)  (3)  (4) 


(1)  (2)  (3)  (4) 


k  . 


Table  5.1  Pairwise  Comparisons  of  Four  Varieties  of  Rice  on  Ail  Variables 


Table  5.2  Multi-Sample  Cluster  Analysis  of  K  =  4  Varieties  of  Rice  into 


k  =  1,2,3,  and  4-Sampie  Clusters, 
The  AIC's  and  SC’s  on  All  Variables 


Alternative 

Clustering 

k 

m 

AIC((u  ,Z)) 

~8 

SC(((M  ,») 

~g 

1 

(1.2, 3, 4) 

1 

5 

185.440125* 

190.418762* 

2 

(2,3,4)  (1) 

2 

7 

180.937897* 

187.907990 

3 

(1,3,4)  (2) 

2 

7 

183.446991 

190.417084 

4 

(1,2,4)  (3) 

2 

7 

187.684692 

194.654785 

5 

(1.2,3)  (4) 

2 

7 

183.596893 

190.566986 

6 

(1.4)  (2,3) 

2 

7 

185.540253 

192.510345 

7 

(1,3)  (2.4) 

2 

7 

175.777649* 

182.747742* 

8 

(1,2)  (3.4) 

2 

7 

188.730988 

195.701080 

9 

(3,4)  (1)  (2) 

3 

9 

181.179932 

190.141510 

(2.4)  (1)  (3) 

3 

9 

177.366486 

186.328064 

11 

(2.3)  (1)  (4) 

3 

9 

180.485565 

189.447144 

12 

(1.4)  (2)  (3) 

3 

9 

185.593323 

194.554901 

13 

(1.3)  (2)  (4) 

3 

9 

176.836731* 

185.798309* 

14 

(1,2)  (3)  (4) 

3 

9 

187.137421 

196.098999 

15 

(1)  (2)  (3)  (4) 

4 

11 

178.289734* 

189.242767* 

E:  A  =  1,  B  =  2,  C  =  3,  and 

D  =  4; 

n  =  ; 

20  observations 

p  =  2  variables;  m  =  kp  +  p(p+l)/2  parameters; 


AIC({u._,Z})  =  nplog  (2/r)  +  nlog  ln-^  wi  +  np  +  2m 
'••'8  "  “ 

SCUjig.Z))  =  nploge(2*)  +  nioggin-1  Wi  +  np  +  mioge(n) 

* 

Minimum  AIC’s  and  SC’s  for  k  =  1,2,3  and  4-sampie  dusters,  respectively. 


K 


I 


Table  5.3  Univariate  AlC's  on  p  =  2  Variables  for  Four  Varieties  of  Rice 


Variables 

AIC({tfg,o2)) 

* 

1.  Height  of  Plant 

111.70 

2.  Number  of  Tillers  Per  Plant 

65.94 

NOTE:  AlC(ing,o2})  =  nioge(2ir)  +  nioge 

(■§■—]  +  n  +  2(k+l) 

"indicates  that  there  is  a  difference  in  heights  between 


the  varieties. 


Table  5.4 


The  Class  Statistics  of  Landsat-2 
Multispectral  Scanner  (MSS)  Signatures 


m  i 

m 

Class  Type 

Channel 

Mean  Vector 

Covariance  Matrix  j 

■•i 

Table  5.5 


Multi-Sample  Cluster  Analysis  of  K  =  5  Simulated  Class  Types  of 
Different  Crops  into  k  -  1,2, 3,4,  and  5-Sample  Clusters, 

The  AIC's  and  SC’s  on  All  Variables 


Alternative 

Clustering 

D 

m 

AIC((u  ,ZJ) 

6 

SC(({ua  ,IJ) 

1 

(1,2, 3, 4,5) 

1 

14 

11650.509766* 

11709.513672* 

2 

(1,2,3,4)  (5) 

2 

28 

10434.767578* 

10552.775391* 

3 

(1,2,3,51  (4) 

2 

28 

11100.183594 

11218.191406 

4 

(1,2,4,51  (3) 

2 

28 

11102.822266 

11220.830078 

5 

(1,3,4,51  (2) 

2 

28 

11223.681641 

11341.689453 

6 

(1)  (2,3,4,51 

2 

28 

10954.841797 

11072.849609 

7 

(1,2,3)  (4,5) 

2 

28 

10753.361328 

10871.369141 

8 

(1,2,4)  (3,5) 

2 

28 

11195.396484 

11313.404297 

9 

(1,2,5)  (3,4) 

2 

28 

11245.294922 

11363.302734 

10 

(1,3,4)  (2.5) 

2 

28 

11121.404297 

11239.412109 

11 

(1,3,5)  (2,4) 

2 

28 

11340.630859 

11458.638672 

12 

(1,4.5)  (2,3) 

2 

28 

10475.867188 

10593.875000 

13 

(1,5)  (2,3,4) 

2 

28 

10612.294922 

10730.302734 

14 

(1,4)  (2.3.5) 

2 

28 

10763.261719 

10881.269531 

15 

(1.3)  (2,4,5) 

2 

28 

11045.414063 

11163.421875 

16 

(1.2)  (3,4.5) 

2 

28 

11122.408203 

11240.416016 

17 

(1)  (2,5)  (3,4) 

3 

42 

10669.160156 

10846.171875 

18 

(1)  (2.4)  (3,5) 

3 

42 

10737.947266 

10914.958984 

19 

(1)  (2.3)  (4,5) 

3 

42 

9931.666016 

10108.677734 

20 

(1.5)  (2)  (3,4) 

3 

42 

10321.367188 

10498.378906 

21 

(1,5)  (2,4)  (3) 

3 

42 

10301.919922 

10478.931641 

22 

(1,5)  (2,3)  (4) 

3 

42 

9684.330078 

9861.341797 

22 

(1,4)  (2)  (3,5) 

3 

42 

10377.035156 

10554.046875 

24 

(1,4)  (2,5)  (3) 

3 

42 

10288.802734 

10465.814453 

25 

(1,4)  (2,3)  (5) 

3 

42 

9378.460938* 

9455.472656* 

26 

(1.3)  (2)  (4,5) 

3 

42 

10464.267578 

10641.279297 

27 

(1.3)  (2.5)  (4) 

3 

42 

10564.726563 

10741.738281 

28 

(1,3)  (2,4)  (5) 

3 

42 

10171.974609 

10348.986328 

29 

(1,2)  (4,5)  (3) 

3 

42 

10418.652344 

10595.664063 

30 

(1,2)  (3.5)  (4) 

3 

42 

10607.343750 

10784.355469 

v  w.r 


(1,2)  (3,4)  (5) 

(1.2.3)  (4)  (5) 

(1.2.4)  (3)  (5) 

(1.2.5)  (3)  (4) 

(1.3.4)  (2)  (5) 

(1.3.5)  (2)  (4) 

(1.4.5)  (2)  (3) 
(1)  (2,3,4)  (5) 
(1)  (2,3,5)  (4) 
(1)  (2,4,5)  (3) 
(1)  (2)  (3,4,5) 


10322.818359 

9965.222656 

10218.564453 

10730.000000 

10232.804688 

10844.783203 

10597.609375 

10071.490234 

10628.326172 

10634.554688 

10757.164063 


(1)  (2)  (3)  (4,5) 
(1)  (2)  (3,5)  (4) 
(1)  (2)  (3,4)  (5) 
(1)  (2.5)  (3)  (4) 
(1)  (2,4)  (3)  (5) 
(1)  (2,3)  (4)  (5) 

(1,5)  (2)  (3)  (4) 
(1,4)  (2)  (3)  (5) 
(1.3)  (2)  (4)  (5) 
(1,2)  (3)  (4)  (5) 

(1)  (2)  (3)  (4)  (5) 


i  =  500  total  number  of  observations;  p  =  4  variables 
m  =  kp  +  kp(p+l)/2  parameters; 


AIC((u  ,£J)  =  nplog  (2ff)  +  S  n  iog_ln~  a  I  +  np  +  2m 

— 8  e  g»lseo  » 

k  -1 

SC({u  ,ZJ)  =  nplog  (2k)  +  I  n Jog  In  a  I  +  np  +  miog  (n) 

“e  B  oa  1  ®  B  o  “ 


Minimum  AIC’s  and  SC's  for  k  =  1,2, 3, 4  and  5-sampie  dusters,  respectively. 


STAGE  1: 


(1,2.3, 4, 5) 


STAGE  2: 


STAGE  3: 


STAGE  4: 


STAGE  5: 


r 

(1,4) 


r 

(1,2, 3,4) 

_ L_ 


(5) 


(1)  (4) 


(2,3) 


r 

(2) 


1 

(3) 


Figure  5.1.  OPTIMAL  DECISION  TREE  CLASSIFIER.  This  tree  structure  associated  with 
the  class  statistics  given  in  Table  5.4  was  picked  75  times  in  the  100  different  samples 
using  the  minimum  AIC  procedure,  and  it  was  picked  74  times  by  using  the  minimum  SC 
procedure. 


STAGE  1: 


(1,2, 3, 4, 5) 


STAGE  2: 


STAGE  3: 


STAGE  4: 


STAGE  5: 


Figure  5.2.  SUBOPTIMAL  DECISION  TREE  CLASSIFIER.  This  tree  structure  associated 
with  the  class  statistics  given  in  Table  5.4  was  picked  25  times  in  the  100  different 
samples  using  the  minimum  AIC  and  SC  procedures. 


STAGE  1: 


(1,2, 3, 4,5) 


STAGE  2: 


STAGE  3: 


STAGE  4: 


STAGE  5: 


Figure 

picked 


I - - 1 

(1.2, 3, 4)  (5) 


(1,4)  (2.3) 


(1)  (4) 

H — i 

(2)  (3) 


5.3.  WRONG  DECISION  TREE  CLASSIFIER.  This  is  the  tree  structure  which  was 
by  the  minimum  SC  procedure  wrongfully,  that  is,  SC  missed  the  correct  structure. 


END 

FILMED 

3-85 
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